Introduction

Within the subject area of machine learning, gradient descent is perhaps the most well-known and commonly used iterative optimization method for finding the parameters that minimize error functions such as residual sum of squares for (\(L_1\)-regularized) linear regression, or negative log-likelihoods for logistic regression and neural networks.

However, there are a few alternative methods that may perform better than gradient descent in specific cases. For example:


In this notebook, we will be comparing coordinate and gradient descent in the bivariate function setting—functions only accepting two inputs. This comparison is aimed at understanding optimization methods more abstractly, rather than their applications to machine learning.

It is important to note that coordinate descent generally refers to the approach of minimizing a multivariate function by minimizing it in one direction or coordinate at a time (solving univariate optimization problems repeatedly). This means that it is possible to apply coordinate descent to both differentiable and derivative-free contexts3.

However, in this notebook, coordinate descent refers to gradient-based coordinate descent.

Setup

# Set working directory
setwd("~/Development/R/minimizers")

# Install dependencies
#install.packages("shiny")
#install.packages("plotly")
#install.packages("plyr")
#install.packages("ggplot2")
#install.packages("ggpubr")

# Require dependencies
require(shiny)
require(plotly)
require(plyr)
require(ggplot2)
require(ggpubr)

# Set seed for reproducibility
set.seed(9292)

Test functions

Comparison metrics will include benchmarks, number of iterations and convergence/divergence. These metrics will be used to test both optimization methods on two differentiable bivariate functions: the Six-hump camel function and the Peaks function.

Six-hump camel function

The Six-hump camel function is a test function used for investigating the performance of optimization algorithms.4

The function is defined by:

\[ F_1(x,y)=\left(4-2.1x^2+\frac{x^4}{3}\right)x^2+xy+4\left(y^2-1\right)y^2 \] Despite being well defined for all \((x,y)\in\mathbb{R}^2\), the focal point of this function is within the \(x,y\in[-6,6]\) region, where all of the turning points of the function lie. However, for visualization, this region can be made smaller: \(x\in[-2,2],\ y\in[-1,1]\).

# Import Six-hump camel function
source("functions/camel.R")

# Set input variable ranges
F1.x <- seq(-2, 2, length=100)
F1.y <- seq(-1, 1, length=100)
F1.z <- t(outer(F1.x, F1.y, F1))

# Display plot and contours
F1.plot <- plot_ly(x=F1.x, y=F1.y, z=F1.z, width=700, height=500) %>%
  add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor="#ff0000", project=list(z=TRUE)))) %>%
  layout(scene=list(zaxis=list(range=c(-2,6)), aspectmode="manual", aspectratio=list(x=1, y=1, z=0.6)),
         title="Six-hump camel function (F1)")

div(F1.plot, align="center")

Minima

Type \(x\) \(y\) \(f(x,y)\)
Global \(0.089842\) \(-0.712656\) \(-1.031628453\)
Global \(-0.089842\) \(0.712656\) \(-1.031628453\)

Partial derivatives

The gradient vector \(\nabla F_1(x,y)\) for the Six-hump camel function consists of the following partial derivatives:

\[ \begin{align} \nabla F_1(x,y) &=\begin{pmatrix} \frac{\partial F_1}{\partial x}\\ \frac{\partial F_1}{\partial y} \end{pmatrix}\\ &=\begin{pmatrix} 2\left(x^5-4.2x^3+4x+0.5y\right) \\ x+16y^3-8y \end{pmatrix} \end{align} \]

The Peaks function

The Peaks function5 is a function often used to demonstrate 3D graphing software or libraries in MATLAB and other programming languages.

The function is defined by:

\[ \begin{align} F_2(x,y)&= 3(1-x)^2e^{-x^2-(y+1)^2}\\ &+10\left(\frac{1}{5}x-x^3-y^5\right)e^{-\left(x^2+y^2\right)}\\ &+\frac{1}{3}e^{-(x+1)^2-y^2} \end{align} \]

For optimization tests, this function is normally restricted to the \(x,y\in[-4,4]\) region, as this is where the extrema of the function lie. However, for visualization purposes, \([-3,3]\) is preferable for \(x\).

# Import Peaks function
source("functions/peaks.R")

# Set input variable ranges
F2.x <- seq(-3, 3, length=100)
F2.y <- seq(-3, 3, length=100)
F2.z <- t(outer(F2.x, F2.y, F2))

# Display plot and contours
F2.plot <- plot_ly(x=F2.x, y=F2.y, z=F2.z, width=700, height=500) %>%
  add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor="#ff0000", project=list(z=TRUE)))) %>%
  layout(scene=list(aspectmode="manual", aspectratio=list(x=1, y=1, z=0.65)),
         title="Peaks function (F2)")

div(F2.plot, align="center")

Minima

Type \(x\) \(y\) \(f(x,y)\)
Global \(0.2282799999\) \(-1.6255310720\) \(-6.5511333326\)

Partial derivatives

The gradient vector \(\nabla F_2(x,y)\) for the Peaks function consists of the following partial derivatives:

\[ \begin{align} \nabla F_2(x,y) &=\begin{pmatrix} \frac{\partial F_2}{\partial x}\\ \frac{\partial F_2}{\partial y} \end{pmatrix}\\ &=\begin{pmatrix} -\frac{2}{3}e^{-\big(x^2 + 2 x + (y + 1)^2\big)} \Big( e^{2x+2y+1}\big( 30x^3-6x^2+30x(y^5-1)+3 \big) + 9e^{2x}\left( x^3-2x^2+1 \right) + (x+1)\left(-e^{2y}\right) \Big)\\ -\frac{2}{3}e^{-\big(x^2+2x+(y+1)^2\big)} \Big( 3ye^{2x+2y+1}\big( 10x^2-2x+5(2y^2-5)y^3 \big) + 9e^{2x}(1-x)^2(y+1) - ye^{2y} \Big) \end{pmatrix} \end{align} \]

These partial derivatives are far more complex than those of the Six-hump camel function. Since both of the optimization methods we are testing are gradient-based, it is likely that benchmarks for this test function will take significantly longer.

With complex derivatives such as these (or when the function is non-differentiable entirely), it is often more preferable to use derivative-free optimization methods6 such as the Nelder-Mead method7.

Optimization methods

Gradient descent

Gradient descent is a gradient-based iterative optimization method. Gradient descent works by iteratively updating the current position by taking a small step in the direction of steepest descent8.

\[ \mathbf{x}_{i+1} \leftarrow \mathbf{x}_i - \eta \nabla F(\mathbf{x}) \]

Where:

  • \(\eta\) is the learning rate for the algorithm—the size of the step taken on each iteration.
  • \(\nabla F(\mathbf{x})\) is the gradient vector of \(F(\mathbf{x})\), which represents the direction of steepest ascent.

With a bivariate function \(F(x,y)\), we can update the vector \(\begin{pmatrix}x \\ y\end{pmatrix}\) with:

\[ \begin{pmatrix} x \\ y \end{pmatrix} \leftarrow \begin{pmatrix} x \\ y \end{pmatrix} -\eta\begin{pmatrix} \frac{\partial F}{\partial x} \\ \frac{\partial F}{\partial y} \end{pmatrix} \] Choosing the learning rate \(\eta\) is a frequent problem in optimization—too large of a learning rate sometimes leads to divergence and missing the minimum due to having such large steps. Conversely, a very small learning rate will require many iterations to converge, which can be costly in terms of time and CPU/memory usage.

In order to ensure convergence, we will use a small learning rate.

# Set a small learning rate
rate <- 0.05

It is also necessary to set a maximum number of iterations for iterative methods. In the event that the method diverges, it will still terminate execution. Likewise, if \(\eta\) is too small and too many iterations have passed without reaching a minimum, the method will stop.

In this case, the maximum number of iterations will be \(200\).

# Set maximum number of iterations
max_iterations <- 200

In this notebook, we will say that an optimization method converges to a minimum point \(\begin{pmatrix}x^* \\ y^*\end{pmatrix}\) if it reaches a point \(\begin{pmatrix}x \\ y\end{pmatrix}\) such that the Euclidean distance (\(L_2\) norm) between the two is less than some \(\epsilon\in\mathbb{R^+}\). In other words, if \(\begin{pmatrix}x \\ y\end{pmatrix}\) is in the \(\epsilon\)-neighbourhood of \(\begin{pmatrix}x^* \\ y^*\end{pmatrix}\): \[ \left|\left| \begin{pmatrix}x^* \\ y^*\end{pmatrix} - \begin{pmatrix}x \\ y\end{pmatrix} \right|\right|_2 < \epsilon\\ \]

# Set a small epsilon
epsilon <- 0.05

We define the gradient descent algorithm iteratively as:

# Gradient descent
#
# Parameters:
#   x.initial: The starting x-coordinate
#   y.initial: The starting y-coordinate
#   FUN: The gradient vector for the function being optimized
#   minimum: The actual minimum of the function
#
# Returns:
#   A tuple consisting of (termination reason, iterations).
#
gradient.descent <- function(x.initial, y.initial, FUN, minimum) {
  for (iterations in 1:max_iterations) {
    # Add starting point to path if it is the first iteration
    if(iterations == 1) path[[1]] <<- c(x.initial, y.initial)
    
    # Retrieve current position from path vector
    x.i = path[[iterations]][1]
    y.i = path[[iterations]][2]
    
    # Check for convergence
    if(dist(rbind(c(x.i, y.i), minimum)) <= epsilon) return(c("Converged", iterations-1))
    
    # Make an update to the current position
    updated <- c(x.i, y.i) - rate * FUN(x.i, y.i)
    
    # Increment iterations and store updated position in `path` variable
    iterations <- iterations + 1
    path[[iterations]] <<- updated
  }
  
  return(c("Maximum iterations reached", iterations-1))
}

Where: path is a global variable that stores all of the intermediate vectors during the optimization process.

Testing on the Six-hump camel function

Before we start testing, we need a minimum of interest, along with a starting point for the optimization process:

# Minimum of interest
F1.minimum <- c(0.089842, -0.712656)

# Starting point
F1.start <- c(0.85, 0)

Additionally, we will need to initialize the path variable for keeping track of the updated vectors during the optimization process. This variable will be reused for other tests, so we will also define a reset function.

# Create an empty path vector
path <<- vector('list', max_iterations)
Error in names(x) <- paste("V", seq_along(x), sep = "") : 
  'names' attribute [1] must be the same length as the vector [0]
# Reset function for the `path` variable
reset_path <- function() { path <<- vector('list', max_iterations) }

Running the gradient descent method (with benchmarking):

# Run timed gradient descent
F1.gradient.time <- proc.time()
F1.gradient.results <- gradient.descent(F1.start[1], F1.start[2], F1.d, F1.minimum)
F1.gradient.time <- proc.time() - F1.gradient.time

Plotting the contour and path of updates:

# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F1.gradient.contour <- plot_ly(x=~F1.x, y=~F1.y, z=F1.z, width=800, height=500,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F1.minimum[1], y=F1.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F1.start[1], y=F1.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F1.minimum[1], y=F1.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="GD optimization of the six-hump camel function (F1)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F1.gradient.contour, align="center")

Termination reason and number of iterations:

cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F1.gradient.results[1], F1.gradient.results[2]))
Termination reason: Converged
Number of iterations: 9

Benchmarking for these optimization methods will be done with the proc.time base function in R9. This function records the time taken for a specific action to finish execution, according three different time measures:

  1. user: The CPU time charged for the execution of user instructions of the calling process.
  2. system: The CPU time charged for execution by the system on behalf of the calling process.
  3. elapsed: The “real” elapsed time since the process was started.

Printing the proc_time result displays these measures in order:

print(F1.gradient.time)
   user  system elapsed 
  0.198   0.084   0.293 

Testing on the Peaks function

Setting a minimum of interest, along with a starting point:

# Minimum of interest
F2.minimum <- c(0.228279999979237, -1.625531071954464)

# Starting point
F2.start <- c(1, -0.2)

Resetting the optimization path variable (path) and running the gradient descent method (with benchmarking):

# Resetting optimization path
reset_path()
Error in names(x) <- paste("V", seq_along(x), sep = "") : 
  'names' attribute [1] must be the same length as the vector [0]
# Run timed gradient descent
F2.gradient.time <- proc.time()
F2.gradient.results <- gradient.descent(F2.start[1], F2.start[2], F2.d, F2.minimum)
F2.gradient.time <- proc.time() - F2.gradient.time

Plotting the contour and path of updates:

# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F2.gradient.contour <- plot_ly(x=~F2.x, y=~F2.y, z=F2.z, width=700, height=600,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F2.minimum[1], y=F2.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F2.start[1], y=F2.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-5, ay=-40,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F2.minimum[1], y=F2.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="GD optimization of the peaks function (F2)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F2.gradient.contour, align="center")

Termination reason and number of iterations:

cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F2.gradient.results[1], F2.gradient.results[2]))
Termination reason: Converged
Number of iterations: 8

Benchmarks:

print(F2.gradient.time)
   user  system elapsed 
  0.130   0.004   0.140 

Coordinate descent

Gradient-based coordinate descent is a similar method of optimization to gradient descent, as each iteration still updates the position by moving (by a step size \(\eta\)) in the direction of steepest descent.

However, the main difference with coordinate descent is that we optimize one coordinate at a time—essentially solving \(d\) univariate optimization problems (if the function \(F\) is \(d\)-variate). The optimization paths from these sub-problems are then combined to form a solution to the original optimization problem.

For a single coordinate \(i\), the iterative update formula is given by: \[ x_{i+1} \leftarrow x_i - \eta \frac{\partial}{\partial x_i}F(\mathbf{x}) \]

We define the coordinate descent algorithm iteratively as:

# Coordinate descent
#
# Parameters:
#   x.initial: The starting x-coordinate
#   y.initial: The starting y-coordinate
#   FUN: The gradient vector for the function being optimized
#   minimum: The actual minimum of the function
#
# Returns:
#   A tuple consisting of (termination reason, iterations).
#
coordinate.descent <- function(x.initial, y.initial, FUN, minimum) {
  for (iterations in 1:max_iterations) {
    # Add starting point to path if it is the first iteration
    if(iterations == 1) path[[1]] <<- c(x.initial, y.initial)
    
    # Retrieve current position from path vector
    x.i = path[[iterations]][1]
    y.i = path[[iterations]][2]
    
    # Check for convergence
    if(dist(rbind(c(x.i, y.i), minimum)) <= epsilon) return(c("Converged", iterations-1))
    
    # Make an update to the current position
    updated <- NULL
    if((iterations %% 2) == 1) {
      updated <- c(x.i, y.i) - rate * c(FUN(x.i, y.i)[1], 0)
    } else {
      updated <- c(x.i, y.i) - rate * c(0, FUN(x.i, y.i)[2])
    }
    
    # Increment iterations and store updated position in `path` variable
    iterations <- iterations + 1
    path[[iterations]] <<- updated
  }
  
  return(c("Maximum iterations reached", iterations-1))
}

Testing on the Six-hump camel function

Running the coordinate descent method (with benchmarking):

# Reset optimization path
reset_path()
Error in names(x) <- paste("V", seq_along(x), sep = "") : 
  'names' attribute [1] must be the same length as the vector [0]
# Run timed coordinate descent
F1.coordinate.time <- proc.time()
F1.coordinate.results <- coordinate.descent(F1.start[1], F1.start[2], F1.d, F1.minimum)
F1.coordinate.time <- proc.time() - F1.coordinate.time

# Retrieve number of iterations
F1.coordinate.iterations <- as.numeric(F1.coordinate.results[[2]])

Termination reason and number of iterations for coordinate descent optimization:

cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F1.coordinate.results[1], F1.coordinate.results[2]))
Termination reason: Converged
Number of iterations: 18

Benchmarks for coordinate descent optimization:

print(F1.coordinate.time)
   user  system elapsed 
  0.064   0.002   0.067 

Plotting the contour and path of updates:

# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F1.coordinate.contour <- plot_ly(x=~F1.x, y=~F1.y, z=F1.z, width=800, height=500,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F1.minimum[1], y=F1.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F1.start[1], y=F1.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F1.minimum[1], y=F1.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="CD optimization of the six-hump camel function (F1)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F1.coordinate.contour, align="center")

Testing on the Peaks function

Running the coordinate descent method (with benchmarking):

# Reset optimization path
reset_path()
Error in names(x) <- paste("V", seq_along(x), sep = "") : 
  'names' attribute [1] must be the same length as the vector [0]
# Run timed x-coordinate descent
F2.coordinate.time <- proc.time()
F2.coordinate.results <- coordinate.descent(F2.start[1], F2.start[2], F2.d, F2.minimum)
F2.coordinate.time <- proc.time() - F2.coordinate.time

# Trucate NULLs in optimization path list
F2.coordinate.path <- Filter(Negate(is.null), path)

# Retrieve number of iterations
F2.coordinate.iterations <- as.numeric(F2.coordinate.results[[2]])

Termination reason and number of iterations for coordinate descent optimization:

cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F2.coordinate.results[1], F2.coordinate.results[2]))
Termination reason: Converged
Number of iterations: 18

Benchmarks for coordinate descent optimization:

print(F2.coordinate.time)
   user  system elapsed 
  0.045   0.002   0.047 

Plotting the contour and path of updates:

# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F2.coordinate.contour <- plot_ly(x=~F2.x, y=~F2.y, z=F2.z, width=700, height=600,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F2.minimum[1], y=F2.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F2.start[1], y=F2.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F2.minimum[1], y=F2.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="CD optimization of the peaks function (F2)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F2.coordinate.contour, align="center")

Method comparison metrics

Benchmarks

Preparing the benchmark comparison plot:

# Dataframe representing the gradient and coordinate descent benchmarks on F1
F1.benchmarks.df <- data.frame(
  Method=rep(c("Gradient descent", "Coordinate descent"), each=3),
  Type=c("User", "System", "Elapsed (Total)"),
  Time=round(c(F1.gradient.time[1:3], F1.coordinate.time[1:3]), digits=3)
)

# Grouped bar plot for comparison of the benchmarks
F1.benchmarks.plot <- ggplot(data=F1.benchmarks.df, aes(x=Type, y=Time, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  geom_text(aes(label=Time), hjust=0.5, color="black", size=2.5, position = position_dodge(0.9), size=3.5) +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  theme_minimal() +
  ggtitle("F1 (six-hump camel function)") +
  theme(plot.margin=unit(c(0.5, 0, 0, 0), "cm")) +
  coord_flip()

# Dataframe representing the gradient and coordinate descent benchmarks on F2
F2.benchmarks.df <- data.frame(
  Method=rep(c("Gradient descent", "Coordinate descent"), each=3),
  Type=c("User", "System", "Elapsed (Total)"),
  Time=round(c(F2.gradient.time[1:3], F2.coordinate.time[1:3]), digits=3)
)

# Grouped bar plot for comparison of the benchmarks
F2.benchmarks.plot <- ggplot(data=F2.benchmarks.df, aes(x=Type, y=Time, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  geom_text(aes(label=Time), hjust=0.5, color="black", size=2.5, position = position_dodge(0.9), size=3.5) +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  theme_minimal() +
  ggtitle("F2 (peaks function)") +
  coord_flip()

# Arranging the subplots
benchmarks.plot <- ggarrange(F1.benchmarks.plot, F2.benchmarks.plot, nrow=2, common.legend=TRUE, legend="bottom")

Plotting the benchmarks comparison:

# Add a plot title
annotate_figure(benchmarks.plot, 
                top=text_grob("Gradient and coordinate descent benchmarks", size=16, face="bold"))

As explained earlier, coordinate descent tends to optimize faster (in terms of run-time) than gradient descent. Despite this, we cannot definitely assume that coordinate descent will always be faster without performing repeated experiments or trying other functions and combinations of \(\eta\), \(\epsilon\) and starting position.

Number of iterations

Preparing the iteration comparison plot:

# Dataframe representing the gradient and coordinate descent iterations on F1
F1.iterations.df <- data.frame(
  Method=c("Gradient descent", "Coordinate descent"),
  Iterations=round(as.numeric(c(F1.gradient.results[[2]], F1.coordinate.iterations)))
)

# Bar plot for comparison of iterations
F1.iterations.plot <- ggplot(data=F1.iterations.df, aes(x=Method, y=Iterations, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  geom_text(aes(label=Iterations), hjust=3, color="white", size=4, position = position_dodge(0.9), size=3.5) +
  scale_y_continuous(breaks=function(x) unique(floor(pretty(seq(0, (max(x) + 1) * 1.1))))) + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("F1 (six-hump camel function)") +
  theme(plot.margin=unit(c(0.5, 0, 0, 0), "cm")) +
  coord_flip()

# Dataframe representing the gradient and coordinate descent iterations on F2
F2.iterations.df <- data.frame(
  Method=c("Gradient descent", "Coordinate descent"),
  Iterations=round(as.numeric(c(F2.gradient.results[[2]], F2.coordinate.iterations)))
)

# Bar plot for comparison of iterations
F2.iterations.plot <- ggplot(data=F2.iterations.df, aes(x=Method, y=Iterations, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  geom_text(aes(label=Iterations), hjust=3, color="white", size=4, position = position_dodge(0.9), size=3.5) +
  scale_y_continuous(breaks=function(x) unique(floor(pretty(seq(0, (max(x) + 1) * 1.1))))) + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("F2 (peaks function)") +
  coord_flip()

# Arranging the subplots
iterations.plot <- ggarrange(F1.iterations.plot, F2.iterations.plot, nrow=2, common.legend=TRUE, legend="bottom")

Plotting the iterations comparison:

# Add a plot title
annotate_figure(iterations.plot, 
                top=text_grob("Gradient and coordinate descent iterations", size=16, face="bold"))

Unlike benchmarks, the number of iterations that each of these optimization method takes to converge (if it does converge) does not vary per experiment—it will never change unless we change \(\eta\), \(\epsilon\) or the starting position, which we didn’t. For this reason, repeated experiments are not as necessary as they are for benchmarking, where the uncertainty of CPU resource allocation etc. is enough to warrant repeated experiments.

However, it still isn’t fair to make a concrete conclusion from these results, since we would need to test the optimization methods on different objective functions, different \(\eta\), \(\epsilon\) and starting positions.

Nevertheless, coordinate descent optimized the objective functions in far more iterations than gradient descent.

Resources


  1. https://en.wikipedia.org/wiki/Limited-memory_BFGS

  2. https://web.stanford.edu/~hastie/TALKS/glmnet.pdf

  3. https://en.wikipedia.org/wiki/Coordinate_descent

  4. https://www.sfu.ca/~ssurjano/camel6.html

  5. https://al-roomi.org/benchmarks/unconstrained/2-dimensions/63-peaks-function

  6. https://en.wikipedia.org/wiki/Derivative-free_optimization

  7. https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method

  8. https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

  9. https://stat.ethz.ch/R-manual/R-devel/library/base/html/proc.time.html

---
title: Comparison of gradient-based coordinate descent and gradient descent on continuous bivariate functions
output:
  html_notebook:
    code_folding: none
    theme: spacelab
    toc: yes
---

# Introduction

Within the subject area of machine learning, gradient descent is perhaps the most well-known and commonly used iterative optimization method for finding the parameters that minimize error functions such as residual sum of squares for ($L_1$-regularized) linear regression, or negative log-likelihoods for logistic regression and neural networks.

However, there are a few alternative methods that may perform better than gradient descent in specific cases. For example:

- **Limited-memory BFGS (L-BFGS)** is particularly suited to problems with very large numbers of variables[^l-bfgs].
- **Coordinate descent** is generally faster than gradient descent (despite converging in more iterations), and is frequently used for $L_1$-regularized least squares problems[^glmnet].

---

In this notebook, we will be comparing coordinate and gradient descent in the bivariate function setting---functions only accepting two inputs. This comparison is aimed at understanding optimization methods more abstractly, rather than their applications to machine learning.

It is important to note that coordinate descent generally refers to the approach of minimizing a multivariate function by minimizing it in one direction or coordinate at a time (solving univariate optimization problems repeatedly). This means that it is possible to apply coordinate descent to both differentiable and derivative-free contexts[^coordinate-descent].

However, in this notebook, coordinate descent refers to gradient-based coordinate descent.

## Setup

```{r}
# Set working directory
setwd("~/Development/R/minimizers")

# Install dependencies
#install.packages("shiny")
#install.packages("plotly")
#install.packages("plyr")
#install.packages("ggplot2")
#install.packages("ggpubr")

# Require dependencies
require(shiny)
require(plotly)
require(plyr)
require(ggplot2)
require(ggpubr)

# Set seed for reproducibility
set.seed(9292)
```

## Test functions

Comparison metrics will include benchmarks, number of iterations and convergence/divergence. These metrics will be used to test both optimization methods on two differentiable bivariate functions: the **Six-hump camel function** and the **Peaks function**.

### Six-hump camel function

The Six-hump camel function is a test function used for investigating the performance of optimization algorithms.[^camel]

The function is defined by:

$$
F_1(x,y)=\left(4-2.1x^2+\frac{x^4}{3}\right)x^2+xy+4\left(y^2-1\right)y^2
$$
Despite being well defined for all $(x,y)\in\mathbb{R}^2$, the focal point of this function is within the $x,y\in[-6,6]$ region, where all of the turning points of the function lie. However, for visualization, this region can be made smaller: $x\in[-2,2],\ y\in[-1,1]$.

```{r}
# Import Six-hump camel function
source("functions/camel.R")

# Set input variable ranges
F1.x <- seq(-2, 2, length=100)
F1.y <- seq(-1, 1, length=100)
F1.z <- t(outer(F1.x, F1.y, F1))

# Display plot and contours
F1.plot <- plot_ly(x=F1.x, y=F1.y, z=F1.z, width=700, height=500) %>%
  add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor="#ff0000", project=list(z=TRUE)))) %>%
  layout(scene=list(zaxis=list(range=c(-2,6)), aspectmode="manual", aspectratio=list(x=1, y=1, z=0.6)),
         title="Six-hump camel function (F1)")

div(F1.plot, align="center")
```

#### Minima

| Type | $x$ | $y$ | $f(x,y)$ |
|------|-----|-----|----------|
|Global|$0.089842$|$-0.712656$|$-1.031628453$|
|Global|$-0.089842$|$0.712656$|$-1.031628453$|

#### Partial derivatives

The gradient vector $\nabla F_1(x,y)$ for the Six-hump camel function consists of the following partial derivatives:

$$
\begin{align}
  \nabla F_1(x,y)
  &=\begin{pmatrix}
    \frac{\partial F_1}{\partial x}\\
    \frac{\partial F_1}{\partial y}
  \end{pmatrix}\\
  &=\begin{pmatrix}
    2\left(x^5-4.2x^3+4x+0.5y\right)
    \\
    x+16y^3-8y
  \end{pmatrix}
\end{align}
$$

### The Peaks function

The Peaks function[^peaks] is a function often used to demonstrate 3D graphing software or libraries in MATLAB and other programming languages.

The function is defined by:

$$
\begin{align}
  F_2(x,y)&=
  3(1-x)^2e^{-x^2-(y+1)^2}\\
  &+10\left(\frac{1}{5}x-x^3-y^5\right)e^{-\left(x^2+y^2\right)}\\
  &+\frac{1}{3}e^{-(x+1)^2-y^2}
\end{align}
$$

For optimization tests, this function is normally restricted to the $x,y\in[-4,4]$ region, as this is where the extrema of the function lie. However, for visualization purposes, $[-3,3]$ is preferable for $x$.

```{r}
# Import Peaks function
source("functions/peaks.R")

# Set input variable ranges
F2.x <- seq(-3, 3, length=100)
F2.y <- seq(-3, 3, length=100)
F2.z <- t(outer(F2.x, F2.y, F2))

# Display plot and contours
F2.plot <- plot_ly(x=F2.x, y=F2.y, z=F2.z, width=700, height=500) %>%
  add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor="#ff0000", project=list(z=TRUE)))) %>%
  layout(scene=list(aspectmode="manual", aspectratio=list(x=1, y=1, z=0.65)),
         title="Peaks function (F2)")

div(F2.plot, align="center")
```

#### Minima

| Type | $x$ | $y$ | $f(x,y)$ |
|------|-----|-----|----------|
|Global|$0.2282799999$|$-1.6255310720$|$-6.5511333326$|

#### Partial derivatives

The gradient vector $\nabla F_2(x,y)$ for the Peaks function consists of the following partial derivatives:

$$
\begin{align}
  \nabla F_2(x,y)
  &=\begin{pmatrix}
    \frac{\partial F_2}{\partial x}\\
    \frac{\partial F_2}{\partial y}
  \end{pmatrix}\\
  &=\begin{pmatrix}
    -\frac{2}{3}e^{-\big(x^2 + 2 x + (y + 1)^2\big)}
    \Big(
      e^{2x+2y+1}\big(
        30x^3-6x^2+30x(y^5-1)+3
      \big)
      +
      9e^{2x}\left(
       x^3-2x^2+1
      \right)
      +
      (x+1)\left(-e^{2y}\right)
    \Big)\\
    -\frac{2}{3}e^{-\big(x^2+2x+(y+1)^2\big)}
    \Big(
      3ye^{2x+2y+1}\big(
        10x^2-2x+5(2y^2-5)y^3
      \big)
      +
      9e^{2x}(1-x)^2(y+1)
      -
      ye^{2y}
    \Big)
  \end{pmatrix}
\end{align}
$$

These partial derivatives are far more complex than those of the Six-hump camel function. Since both of the optimization methods we are testing are gradient-based, it is likely that benchmarks for this test function will take significantly longer.

With complex derivatives such as these (or when the function is non-differentiable entirely), it is often more preferable to use derivative-free optimization methods[^derivative-free] such as the Nelder-Mead method[^nelder-mead].

## Optimization methods

### Gradient descent

Gradient descent is a gradient-based iterative optimization method. Gradient descent works by iteratively updating the current position by taking a small step in the direction of steepest descent[^gradient-descent].

$$
\mathbf{x}_{i+1} \leftarrow \mathbf{x}_i - \eta \nabla F(\mathbf{x})
$$

> **Where**:
>
> - $\eta$ is the learning rate for the algorithm---the size of the step taken on each iteration.
> - $\nabla F(\mathbf{x})$ is the gradient vector of $F(\mathbf{x})$, which represents the direction of steepest **ascent**.

With a bivariate function $F(x,y)$, we can update the vector $\begin{pmatrix}x \\ y\end{pmatrix}$ with:

$$
\begin{pmatrix}
  x \\ y
\end{pmatrix}
\leftarrow
\begin{pmatrix}
  x \\ y
\end{pmatrix}
-\eta\begin{pmatrix}
  \frac{\partial F}{\partial x} \\
  \frac{\partial F}{\partial y}
\end{pmatrix}
$$
Choosing the learning rate $\eta$ is a frequent problem in optimization---too large of a learning rate sometimes leads to divergence and missing the minimum due to having such large steps. Conversely, a very small learning rate will require many iterations to converge, which can be costly in terms of time and CPU/memory usage.

In order to ensure convergence, we will use a small learning rate.
```{r}
# Set a small learning rate
rate <- 0.05
```

It is also necessary to set a maximum number of iterations for iterative methods. In the event that the method diverges, it will still terminate execution. Likewise, if $\eta$ is too small and too many iterations have passed without reaching a minimum, the method will stop.

In this case, the maximum number of iterations will be $200$.
```{r}
# Set maximum number of iterations
max_iterations <- 200
```

In this notebook, we will say that an optimization method converges to a minimum point $\begin{pmatrix}x^* \\ y^*\end{pmatrix}$ if it reaches a point $\begin{pmatrix}x \\ y\end{pmatrix}$ such that the Euclidean distance ($L_2$ norm) between the two is less than some $\epsilon\in\mathbb{R^+}$. In other words, if $\begin{pmatrix}x \\ y\end{pmatrix}$ is in the $\epsilon$-neighbourhood of $\begin{pmatrix}x^* \\ y^*\end{pmatrix}$:
$$
\left|\left|
  \begin{pmatrix}x^* \\ y^*\end{pmatrix}
  -
  \begin{pmatrix}x \\ y\end{pmatrix}
\right|\right|_2 < \epsilon\\
$$
```{r}
# Set a small epsilon
epsilon <- 0.05
```

We define the gradient descent algorithm iteratively as:
```{r}
# Gradient descent
#
# Parameters:
#   x.initial: The starting x-coordinate
#   y.initial: The starting y-coordinate
#   FUN: The gradient vector for the function being optimized
#   minimum: The actual minimum of the function
#
# Returns:
#   A tuple consisting of (termination reason, iterations).
#
gradient.descent <- function(x.initial, y.initial, FUN, minimum) {
  for (iterations in 1:max_iterations) {
    # Add starting point to path if it is the first iteration
    if(iterations == 1) path[[1]] <<- c(x.initial, y.initial)
    
    # Retrieve current position from path vector
    x.i = path[[iterations]][1]
    y.i = path[[iterations]][2]
    
    # Check for convergence
    if(dist(rbind(c(x.i, y.i), minimum)) <= epsilon) return(c("Converged", iterations-1))
    
    # Make an update to the current position
    updated <- c(x.i, y.i) - rate * FUN(x.i, y.i)
    
    # Increment iterations and store updated position in `path` variable
    iterations <- iterations + 1
    path[[iterations]] <<- updated
  }
  
  return(c("Maximum iterations reached", iterations-1))
}
```

> **Where**: `path` is a global variable that stores all of the intermediate vectors during the optimization process.

#### Testing on the Six-hump camel function

Before we start testing, we need a minimum of interest, along with a starting point for the optimization process:
```{r}
# Minimum of interest
F1.minimum <- c(0.089842, -0.712656)

# Starting point
F1.start <- c(0.85, 0)
```

Additionally, we will need to initialize the `path` variable for keeping track of the updated vectors during the optimization process. This variable will be reused for other tests, so we will also define a reset function.
```{r}
# Create an empty path vector
path <<- vector('list', max_iterations)
# Reset function for the `path` variable
reset_path <- function() { path <<- vector('list', max_iterations) }
```

Running the gradient descent method (with benchmarking):
```{r}
# Run timed gradient descent
F1.gradient.time <- proc.time()
F1.gradient.results <- gradient.descent(F1.start[1], F1.start[2], F1.d, F1.minimum)
F1.gradient.time <- proc.time() - F1.gradient.time
```

Plotting the contour and path of updates:
```{r, warning=FALSE, message=FALSE}
# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F1.gradient.contour <- plot_ly(x=~F1.x, y=~F1.y, z=F1.z, width=800, height=500,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F1.minimum[1], y=F1.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F1.start[1], y=F1.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F1.minimum[1], y=F1.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="GD optimization of the six-hump camel function (F1)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F1.gradient.contour, align="center")
```

Termination reason and number of iterations:
```{r}
cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F1.gradient.results[1], F1.gradient.results[2]))
```

Benchmarking for these optimization methods will be done with the `proc.time` base function in R[^proc-time]. This function records the time taken for a specific action to finish execution, according three different time measures:

1. `user`: The CPU time charged for the execution of user instructions of the calling process.
2. `system`: The CPU time charged for execution by the system on behalf of the calling process.
3. `elapsed`: The "real" elapsed time since the process was started.

Printing the `proc_time` result displays these measures in order:
```{r}
print(F1.gradient.time)
```

#### Testing on the Peaks function

Setting a minimum of interest, along with a starting point:
```{r}
# Minimum of interest
F2.minimum <- c(0.228279999979237, -1.625531071954464)

# Starting point
F2.start <- c(1, -0.2)
```

Resetting the optimization path variable (`path`) and running the gradient descent method (with benchmarking):
```{r}
# Resetting optimization path
reset_path()

# Run timed gradient descent
F2.gradient.time <- proc.time()
F2.gradient.results <- gradient.descent(F2.start[1], F2.start[2], F2.d, F2.minimum)
F2.gradient.time <- proc.time() - F2.gradient.time
```

Plotting the contour and path of updates:
```{r, warning=FALSE, message=FALSE}
# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F2.gradient.contour <- plot_ly(x=~F2.x, y=~F2.y, z=F2.z, width=700, height=600,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F2.minimum[1], y=F2.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F2.start[1], y=F2.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-5, ay=-40,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F2.minimum[1], y=F2.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="GD optimization of the peaks function (F2)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F2.gradient.contour, align="center")
```

Termination reason and number of iterations:
```{r}
cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F2.gradient.results[1], F2.gradient.results[2]))
```

Benchmarks:
```{r}
print(F2.gradient.time)
```

### Coordinate descent

Gradient-based coordinate descent is a similar method of optimization to gradient descent, as each iteration still updates the position by moving (by a step size $\eta$) in the direction of steepest descent.

However, the main difference with coordinate descent is that we optimize one coordinate at a time---essentially solving $d$ univariate optimization problems (if the function $F$ is $d$-variate). The optimization paths from these sub-problems are then combined to form a solution to the original optimization problem.

For a single coordinate $i$, the iterative update formula is given by:
$$
x_{i+1} \leftarrow x_i - \eta \frac{\partial}{\partial x_i}F(\mathbf{x})
$$

We define the coordinate descent algorithm iteratively as:
```{r}
# Coordinate descent
#
# Parameters:
#   x.initial: The starting x-coordinate
#   y.initial: The starting y-coordinate
#   FUN: The gradient vector for the function being optimized
#   minimum: The actual minimum of the function
#
# Returns:
#   A tuple consisting of (termination reason, iterations).
#
coordinate.descent <- function(x.initial, y.initial, FUN, minimum) {
  for (iterations in 1:max_iterations) {
    # Add starting point to path if it is the first iteration
    if(iterations == 1) path[[1]] <<- c(x.initial, y.initial)
    
    # Retrieve current position from path vector
    x.i = path[[iterations]][1]
    y.i = path[[iterations]][2]
    
    # Check for convergence
    if(dist(rbind(c(x.i, y.i), minimum)) <= epsilon) return(c("Converged", iterations-1))
    
    # Make an update to the current position
    updated <- NULL
    if((iterations %% 2) == 1) {
      updated <- c(x.i, y.i) - rate * c(FUN(x.i, y.i)[1], 0)
    } else {
      updated <- c(x.i, y.i) - rate * c(0, FUN(x.i, y.i)[2])
    }
    
    # Increment iterations and store updated position in `path` variable
    iterations <- iterations + 1
    path[[iterations]] <<- updated
  }
  
  return(c("Maximum iterations reached", iterations-1))
}
```

#### Testing on the Six-hump camel function

Running the coordinate descent method (with benchmarking):
```{r}
# Reset optimization path
reset_path()

# Run timed coordinate descent
F1.coordinate.time <- proc.time()
F1.coordinate.results <- coordinate.descent(F1.start[1], F1.start[2], F1.d, F1.minimum)
F1.coordinate.time <- proc.time() - F1.coordinate.time

# Retrieve number of iterations
F1.coordinate.iterations <- as.numeric(F1.coordinate.results[[2]])
```

Termination reason and number of iterations for coordinate descent optimization:
```{r}
cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F1.coordinate.results[1], F1.coordinate.results[2]))
```

Benchmarks for coordinate descent optimization:
```{r}
print(F1.coordinate.time)
```

Plotting the contour and path of updates:
```{r, warning=FALSE, message=FALSE}
# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F1.coordinate.contour <- plot_ly(x=~F1.x, y=~F1.y, z=F1.z, width=800, height=500,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F1.minimum[1], y=F1.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F1.start[1], y=F1.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F1.minimum[1], y=F1.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="CD optimization of the six-hump camel function (F1)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F1.coordinate.contour, align="center")
```

#### Testing on the Peaks function

Running the coordinate descent method (with benchmarking):
```{r}
# Reset optimization path
reset_path()

# Run timed x-coordinate descent
F2.coordinate.time <- proc.time()
F2.coordinate.results <- coordinate.descent(F2.start[1], F2.start[2], F2.d, F2.minimum)
F2.coordinate.time <- proc.time() - F2.coordinate.time

# Trucate NULLs in optimization path list
F2.coordinate.path <- Filter(Negate(is.null), path)

# Retrieve number of iterations
F2.coordinate.iterations <- as.numeric(F2.coordinate.results[[2]])
```

Termination reason and number of iterations for coordinate descent optimization:
```{r}
cat(sprintf("Termination reason: %s\nNumber of iterations: %s", F2.coordinate.results[1], F2.coordinate.results[2]))
```

Benchmarks for coordinate descent optimization:
```{r}
print(F2.coordinate.time)
```

Plotting the contour and path of updates:
```{r, warning=FALSE, message=FALSE}
# Convert the `path` variable to a matrix
path <- matrix(unlist(path), ncol=2, byrow=TRUE)

# Display contour and optimization path
F2.coordinate.contour <- plot_ly(x=~F2.x, y=~F2.y, z=F2.z, width=700, height=600,
                      type="contour", contours=list(showlabels = TRUE)) %>%

  # Add optimization path trace
  add_trace(x=path[,1], y=path[,2], type="scatter", name="Optimization path", mode="line") %>%

  # Add cross at target minimum position
  add_trace(x=F2.minimum[1], y=F2.minimum[2], type="scatter", name="Target minimum",
            mode="markers", marker=list(color="#5ec4a7", size=6, symbol="x")) %>%

  # Add arrow and annotation for starting position
  add_annotations(x=F2.start[1], y=F2.start[2], text="Start",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=30, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Add arrow and annotation for target minimum
  add_annotations(x=F2.minimum[1], y=F2.minimum[2], text="Target\nminimum",
                  showarrow=TRUE, arrowhead=4, arrowsize=.5, arrowcolor='white', ax=-80, ay=-30,
                  font=list(color = '#81ccc2', family='helvetica', size = 14)) %>%

  # Adding and positioning plot title
  layout(title="CD optimization of the peaks function (F2)", margin=list(l=125, r=0, t=60, b=60))

# Display the contour plot
div(F2.coordinate.contour, align="center")
```

# Method comparison metrics

## Benchmarks

Preparing the benchmark comparison plot:
```{r, warning=FALSE, message=FALSE}
# Dataframe representing the gradient and coordinate descent benchmarks on F1
F1.benchmarks.df <- data.frame(
  Method=rep(c("Gradient descent", "Coordinate descent"), each=3),
  Type=c("User", "System", "Elapsed (Total)"),
  Time=round(c(F1.gradient.time[1:3], F1.coordinate.time[1:3]), digits=3)
)

# Grouped bar plot for comparison of the benchmarks
F1.benchmarks.plot <- ggplot(data=F1.benchmarks.df, aes(x=Type, y=Time, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  geom_text(aes(label=Time), hjust=0.5, color="black", size=2.5, position = position_dodge(0.9), size=3.5) +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  theme_minimal() +
  ggtitle("F1 (six-hump camel function)") +
  theme(plot.margin=unit(c(0.5, 0, 0, 0), "cm")) +
  coord_flip()

# Dataframe representing the gradient and coordinate descent benchmarks on F2
F2.benchmarks.df <- data.frame(
  Method=rep(c("Gradient descent", "Coordinate descent"), each=3),
  Type=c("User", "System", "Elapsed (Total)"),
  Time=round(c(F2.gradient.time[1:3], F2.coordinate.time[1:3]), digits=3)
)

# Grouped bar plot for comparison of the benchmarks
F2.benchmarks.plot <- ggplot(data=F2.benchmarks.df, aes(x=Type, y=Time, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  geom_text(aes(label=Time), hjust=0.5, color="black", size=2.5, position = position_dodge(0.9), size=3.5) +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  theme_minimal() +
  ggtitle("F2 (peaks function)") +
  coord_flip()

# Arranging the subplots
benchmarks.plot <- ggarrange(F1.benchmarks.plot, F2.benchmarks.plot, nrow=2, common.legend=TRUE, legend="bottom")
```

Plotting the benchmarks comparison:
```{r}
# Add a plot title
annotate_figure(benchmarks.plot, 
                top=text_grob("Gradient and coordinate descent benchmarks", size=16, face="bold"))
```

As explained earlier, coordinate descent tends to optimize faster (in terms of run-time) than gradient descent. Despite this, we cannot definitely assume that coordinate descent will always be faster without performing repeated experiments or trying other functions and combinations of $\eta$, $\epsilon$ and starting position.

## Number of iterations

Preparing the iteration comparison plot:
```{r, warning=FALSE, message=FALSE}
# Dataframe representing the gradient and coordinate descent iterations on F1
F1.iterations.df <- data.frame(
  Method=c("Gradient descent", "Coordinate descent"),
  Iterations=round(as.numeric(c(F1.gradient.results[[2]], F1.coordinate.iterations)))
)

# Bar plot for comparison of iterations
F1.iterations.plot <- ggplot(data=F1.iterations.df, aes(x=Method, y=Iterations, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  geom_text(aes(label=Iterations), hjust=3, color="white", size=4, position = position_dodge(0.9), size=3.5) +
  scale_y_continuous(breaks=function(x) unique(floor(pretty(seq(0, (max(x) + 1) * 1.1))))) + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("F1 (six-hump camel function)") +
  theme(plot.margin=unit(c(0.5, 0, 0, 0), "cm")) +
  coord_flip()

# Dataframe representing the gradient and coordinate descent iterations on F2
F2.iterations.df <- data.frame(
  Method=c("Gradient descent", "Coordinate descent"),
  Iterations=round(as.numeric(c(F2.gradient.results[[2]], F2.coordinate.iterations)))
)

# Bar plot for comparison of iterations
F2.iterations.plot <- ggplot(data=F2.iterations.df, aes(x=Method, y=Iterations, fill=Method)) +
  geom_bar(stat="identity", position="dodge") +
  scale_fill_manual("Method", values=c("Gradient descent"="#70bacf", "Coordinate descent"="#eb6963")) +
  #scale_fill_brewer(palette="Paired") +
  geom_text(aes(label=Iterations), hjust=3, color="white", size=4, position = position_dodge(0.9), size=3.5) +
  scale_y_continuous(breaks=function(x) unique(floor(pretty(seq(0, (max(x) + 1) * 1.1))))) + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("F2 (peaks function)") +
  coord_flip()

# Arranging the subplots
iterations.plot <- ggarrange(F1.iterations.plot, F2.iterations.plot, nrow=2, common.legend=TRUE, legend="bottom")
```

Plotting the iterations comparison:
```{r}
# Add a plot title
annotate_figure(iterations.plot, 
                top=text_grob("Gradient and coordinate descent iterations", size=16, face="bold"))
```

Unlike benchmarks, the number of iterations that each of these optimization method takes to converge (if it does converge) does not vary per experiment---it will never change unless we change $\eta$, $\epsilon$ or the starting position, which we didn't. For this reason, repeated experiments are not as necessary as they are for benchmarking, where the uncertainty of CPU resource allocation etc. is enough to warrant repeated experiments. 

However, it still isn't fair to make a concrete conclusion from these results, since we would need to test the optimization methods on different objective functions, different $\eta$, $\epsilon$ and starting positions.

Nevertheless, coordinate descent optimized the objective functions in far more iterations than gradient descent.

# Resources

[^l-bfgs]: https://en.wikipedia.org/wiki/Limited-memory_BFGS

[^glmnet]: https://web.stanford.edu/~hastie/TALKS/glmnet.pdf

[^coordinate-descent]: https://en.wikipedia.org/wiki/Coordinate_descent

[^camel]: https://www.sfu.ca/~ssurjano/camel6.html

[^peaks]: https://al-roomi.org/benchmarks/unconstrained/2-dimensions/63-peaks-function

[^derivative-free]: https://en.wikipedia.org/wiki/Derivative-free_optimization

[^nelder-mead]: https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method

[^proc-time]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/proc.time.html

[^gradient-descent]: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html